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Summary 

The analysis of datasets taking the form of simple, undirected graphs continues to gain in im- 
portance across a variety of disciplines. Two choices of null model, the logistic-linear model 
and the implicit log-linear model, have come into common use for analyzing such network 
data, in part because each accounts for the heterogeneity of network node degrees typically 
observed in practice. Here we show how these both may be viewed as instances of a broader 
class of null models, with the property that all members of this class give rise to essentially the 
same likelihood-based estimates of link probabilities in sparse graph regimes. This facilitates 
likelihood-based computation and inference, and enables practitioners to choose the most appro- 
priate null model from this family based on application context. Comparative model fits for a 
variety of network datasets demonstrate the practical implications of our results. 

Some key words: Approximate likelihood-based inference; Generalized linear model; Network data; Null model; 
Social network analysis; Sparse random graph. 



1. Introduction 



Statisticians have long recognized the importance of so-called null models. There are two main 
uses for null models: (1) they serve as baseline points of comparison for assessing goodness of 
fit (e.g., for score tests and analysis of variance); and (2) they facilitate residuals-based analyses 
(e.g., for exploratory data analysis and outlier detection). 

In contexts where data take the form of a simple, undirected network on n nodes, the model 
posited by Erdos & Renyi ( |1959| ), considered with edges appearing as independently and identi- 
cally distributed Bernoulli trials, is perhaps the simplest possible null model. However, with only 
a single parameter, it lacks the ability to capture the extent of degree heterogeneity commonly 



associated with network data in practice (Barabasi & Aibert||2009| ). As alternatives, two popular 
n-parameter models have emerged in the literature, each of which associates a single parameter 
(or estimate thereof) to every node, and in doing so allows for heterogeneity of nodal degrees. 

The logistic-linear model takes the probability pij of observing an edge between nodes i and j 
to be given by 



logitpjj = a.i + a 



3 1 



where a 



[ai, ■ 



, a n ) is a vector of node-specific parameters. Chatterjee et al. (201 1 ) term this 



the /3-model; it has also been considered by Park & Newman (2004 1 and Blitzstein & Diaconis 
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), and before them by Holland & Leinhardt ( 1981 1 in its directed form. See also Hunter 



(2004) and |Rinaldo et al.| ( |201 1 1, and references therein. 



The (implicit) log-linear model instead takes edge probabilities to be given in terms of an 
observed binary, symmetric adjacency matrix X as 

\ogpij = \ogX i+ + \ogX j+ - logX ++ , 

where Xi + = Ylk=i ^ik is the degree of the ith node, and X ++ = Y17=l the sum of all 
observed degrees. This model is implicit in that its specified edge probabilities depend on the 
observed data; thus, it is not a proper null model. It is more accurate to say that each pij here is 



the estimated probability under a model, and that the model has been left unspecified. IGirvan & 



Newman| ( |2002[ ) take this as a basis for their residuals-based approach to community detection or 



nodal partitioning in networks, while Chung et al. (2003 1 and others have studied its associated 
spectral graph properties. 

Both the logistic-linear and log-linear models have appealing features. From a statistical stand- 
point, the former is more convenient since it is based on the canonical link function, whereas 
from an analytical and computational standpoint, the latter is more convenient since the set of 
estimated edge probabilities takes the form of an outer product. However, at the same time, the 
choice between them remains unsatisfying. Practitioners lack the necessary guidance to judge 
which of these two null models is most appropriate in a given context, along with a clear under- 
standing of the differences between them. 

In the sequel we resolve this issue and show that, in the sparse adjacency regimes wherein 
network datasets are typically observed, the two models are equivalent for all practical purposes. 
Specifically, both models lead to essentially the same parameter estimates a,, the same proba- 
bility estimates pij, and the same null log-likelihood L In fact, by considering these two models 
as members of a broader class of null models, we prove the stronger result that all models in this 
family lead to essentially the same maximum likelihood estimates for the parameters of interest. 
We emphasize that our results hold irrespective of the data-generating mechanism giving rise to 
X. Specifically, they hold whenever the node degrees Xi + are small relative to the total number 
of edges X ++ /2 in the network. 



2. Statement of results 
2 • 1 . A family of null models for network data 
As above let X be an n x n binary, symmetric adjacency matrix with zeros along its main 
diagonal, corresponding to a simple, undirected graph on n nodes. Consider X to be random 
and a = (a±, . . . , a n ) a vector of node-specific parameters. We suppose that X has indepen- 
dent Bernoulli elements above the main diagonal such that pr(Xy = 1) = pij(a), and specify a 
corresponding family of probabilistic null models for X, each parameterized by a. 

To this end, let e = {Eij : i 7^ j} be a family of smooth functions, where maps pairs of real 
numbers to real numbers, and £^-(0?, y) = £ji(y, x). Let model M. £ then specify pij as 

M £ : log = ai + atj + £ij(au,aj), (2.1) 

so that we obtain a class of log-linear models indexed by e. 

Observe that with £ij(cxi, ay) identically zero we recover the log-linear null model alluded to 
in the Introduction, explicitly parameterized by a. In fact, this class encompasses three common 



Null models 



3 



choices of link function: 



-Mlog 
M cloglog 
Mogit 



logpij = cti + otj, 

log(- log(l - pij)) = ai + aj, 

logit pi j = ai + aj. 



(2.2a) 
(2.2b) 
(2.2c) 



To see this, set £ij(ai, aj) = log{l — exp(— e ai+a i)} — (a 
log link model .M cloglog, and for the model A4i og it, set Eij(ai, a 



+ aj 



for the complementary log- 
-- - log{l + exp(a i + aj)}. 



As we have seen, the logit-link model -Mi og i t is an undirected version of Holland & Lein 



hardt's ( 198 1| > exponential family random graph model. As noted by |Chatterjee et aL (2011 



the degree sequence of X is sufficient for a in this case, formalizing the null-model intuition 
that all graphs exhibiting the same degree sequence be considered as equally likely. The log-link 



model .A/fiog can be considered as an alternate parametrization of Chung & Lu s (2002 1 expected 



degree model, with the additional constraint that self-loops of the form Xu = 1 are explicitly 
disallowed. The complementary log-log model .A/f cloglog has not seen application in the network 
literature to date, but the same functional form appears commonly in the context of generalized 



linear models (McCullagh & Nelder 1989). 



2-2. Properties 

Before placing conditions on e, it is instructive to consider key properties of the family of 
models that can be written in the form A4 £ . Observe first that the expected degree of node i is 

E(X l+ ) = J^Pij- (2-3) 

Thus, for models in this class, higher values of a{ lead to higher expected degrees for node i 
whenever e is small. It is therefore natural to posit as a simple estimator of a a monotone trans- 
formation of the degree sequence specified by X : 

oti = log X i+ - log y/X ++ . 

This is equivalent to estimating p^j via 

Xi+Xj+ 



(2.4) 



Pij 



X 



(2.5) 



++ 



which corresponds precisely to the implicit log-linear model described in the Introduction, and 
may also be viewed in light of (|2.3[) as an approximate moment-matching technique. 



We show below that a as defined in ( 2.4 1 suffices as an estimator for this model class in the 



sparse graph regime. To gain intuition into this claim, consider the corresponding likelihood 
function as follows. 

The log-likelihood for any simple, undirected graph model with independent edges is 



io g {n 



xf?(i 

11 v 



X, 



I 1 — Pij j, 



E 

Kj 



Xij log + (1 - Xij) log(l - pij) 



(2.6) 



Intuitively, if p^ is small, then a Bernoulli random variable with mean pij behaves like a Poisson 
random variable having the same mean. In a rough sense, this Bernoulli log-likelihood is close 



to one under which Xa is treated as a Poisson random variable: 



io g {n 

Kj 



p 



13 



exp(-pij)J = ^Xijlogpij 

13 ' i<j 



Pij j 
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up to a constant shift depending on X. 

Substituting the log-linear model parametrization M.\ og of ( 2.2a I, which corresponds to the 
canonical link under Poisson sampling, we obtain 

4ois(a) = S ^X ij {a i + ctj) - exp(aj + ay) 

i<j 
n 

= ati X i+ - exp(aj + atj), 

i=l i^j 

and thus, the solution to the Poisson likelihood equation V^p i s (o;) = satisfies 

X i+ = ^exp(aj + atj) (i = l,...,n). 

When a is set to a as defined in (|2.4|), the right-hand side becomes 



m x++ i# 

jo, 1 1 Xl+ 



and so we see that each component of Wp i s (a) is precisely Xf + / X ++ . Hence, when this quan- 
tity is small for every i, we can expect that both a and correspondingly p are close to their 
respective maximum likelihood estimates. We formalize this notion as follows. 

2-3. Approximation results for maximum likelihood inference 
Our main result is an approximation theorem for likelihood-based inference under models 



taking the form of M. £ from ( |2.1| ). Under suitable sparsity constraints and for many choices of e, 



including those given by \2.2\ , a maximum likelihood estimate of each parameter a>i exists and 



is close to &i as defined in ( |2.4| ). Furthermore, the corresponding maximum likelihood estimate 
of each edge probability pij is close to pij, defined in ( |2.5| >, and the null log-likelihood under M. e 
evaluated at p is close to that evaluated at the corresponding maximum likelihood estimate of p. 
These approximation results hold for all e satisfying the following condition. 

ASSUMPTION 1. For all pairs i, j and all choices of k, I, and m, the functions Eij, deij/dak, 
d 2 Eij/ (dctkdai), and d^e^j (dakdaida m ), are sub-exponential in ol{ + ctj. That is, there exists 
a constant Cq such that the absolute values of these functions are bounded by Cq exp(a» + ctj). 



Recall model M.\ og from ( 2.2a ), for which e is identically zero and thus satisfies Assumption [T] 



with Co = 0. One can show that the specifications of e arising in the Ai c \ g\ g and -Mi og it models, 
from ( 2.2b I and ( 2.2c[ ) respectively, satisfy Assumption [I] with Cq equal to 1/2 and 1. 



Our sparsity requirement is that each component Xf + / X ++ of Wp i s (5) be sufficiently small; 



for example, 15 in the case of the log-link model M.\ og . We then have the following. 

'2 



THEOREM 1 . Suppose X is an n x n adjacency matrix such that 1 < Xf, < £q X ++ for all 



i. For some set of smooth functions e satisfying Assumption^ let model A4 £ with parameter 
vector a in W n specify edge probability p^ = Pij(ct) as in ( |2.1[ ). Let a be as defined i n d2.4| ). 

Define e = {15 [Cq + 1)}~ 2 and C = 10 (C + 1), where C is as in Assumption\l\ If Eq < 
£q then there exists a solution to the likelihood equation, a, such that 

\\ct - 5||oo < C £q. 
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As shown in the Appendix, the corresponding approximation result for the maximum like- 
lihood estimate of pij is a straightforward consequence, and an approximation for the log- 
likelihood itself also follows. 

COROLLARY 1. Suppose the conditions of Theorem^hold and that pij = pij(a). If £q < Eq, 
then 



Pij Pij 



Pij 



< Cl £ , 



where d = 24 (C + 1). 

COROLLARY 2. Suppose the conditions of Theorem^hold, that £ is the log-likelihood under 
M £ , evaluated at p, and that £ is defined analogously as 



£ = ^2 Xi i lo &Pij ~ i 1 - X ij) togC 1 ~ Pij] 



i<j 



If £o < £o> then 



<C 2 E , 



where C 2 = 49 (C + 1) 



Notably, the results of Theorem [T] and Corollaries [T] and [2] are not probabilistic. The only 
assumption on X is that the nodal degrees are nonzero and small relative to the total number of 
edges. Thus, these results hold even when the true model for X is not specified by M. e ; i.e., they 
are robust to model misspecification. 



3. Proof of Theorem Q] 

We now outline the proof of Theorem[TJ deferring requisite technical lemmas to the Appendix. 
We employ |Kantorovich[s (|1948]> anal ysis of Newton's method, specifically the optimal error 



bounds given by Gragg & Tapia ( 1974[), whose notation we adopt below for ease of reference. 



Our strategy is to use the Kantorovich Theorem to establish the existence of a maximum 
likelihood estimate d of a in a neighborhood of a, which in turn can be obtained by applying 
Newton's method with a as the initial point. If we are able to establish the necessary hypotheses, 
then this theorem will enable us to bound the distance between a and a as required. To apply it 
we require a Lipschitz condition on the Jacobian of the corresponding system of equations near 
a, as well as boundedness conditions on the inverse Hessian evaluated at a and also the initial 
step size of Newton's method from a. 

As we show below, the key to these conditions is an approximation of the Hessian by a 
diagonal-plus-rank- 1 -matrix formed from a. First, recall the data log-likelihood under M. £ 
from ( |2.6| ); its gradient and Hessian with respect to a may be written component-wise as 



da 



k 



J2(X kj - e a « +a >) + X kj fkj + £ e a * +a >f k3 , 

j+k j^k j^k 



8 2 £ \-e ak+ai +X kl ^ + e a ^(f kl + ^t) ifk^l, 



da k d ai | _ e ^ + E . +k Xkj |M + E . +k e ^ k+a] U. + 9hA ifk = l 
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where 



fij = fij(a h aj ) = — ^ + —I— 1 + 



den 1 — pij V da 
fij = fij(ai,ctj) = 1 - exp(eij) - exp^)/^. 

The form of these expressions suggests that when e-ij and pij are small, and fij , /jj and their 
derivatives controlled, an approximation of V 2 £(a) based on terms p^ = exp(d« + 6tj) will 
be effective in a neighborhood of a. Defining such a neighborhood parameterized by r > 1 as 
A/" r = {a : \\a — a\\oo < 0-Ogr)/2}, Lemmas 1-5 in the Appendix provide the necessary ap- 
proximation bounds as a function of r. 

Now define vector d = (d\, . . . , d n ) with di = Xi+, and, setting D = diag(cZ), write 



where we choose to be 



V 2 £(a) = H(I + E), 



H = -[D + — — dd T 



The Sherman-Morrison formula gives 

and with this expression we may bound the norm of E, according to Lemma[6]in the Appendix. 

To conclude the proof of Theorem [11 consider the system of equations F(x) = D^ 1 V£(x 
with derivative matrix F'(x) = D~ 1 [^p£(x)]. Equipped with the norm || • ||oo, Lemmas 7]and[ 
then establish the bounding constants x and 5, and Lemma [9] the Lipschitz constant A, neces 
sary to apply Kantorovich's result, taking initial Newton iterate xq = a and defining subsequent 
iterates recursively by xu+i = %k — [F'{x k )}- l F{x k ) =x k - [V 2 £(x k )]- l V£{x k ). Lemmas^ 
and[8]require ||-E||oo < 1. and Lemma [8] further relates 5 = L\ xeo, with constant L\ defined in 
Lemma |H 

If we define h = 2x\5 and t* = (2/h)(l — y/l — h) 5, then Kantorovich's Theorem asserts 
that when h < 1 and t* < (log r)/2, each iterate x k is in J\f r and x* is well defined, in the sense 
that as k increases, x k converges to a limit point x* such that F(x*) = 0. The matrix D is of 
full rank, and so in this case V£(x*) = as well, implying that x* is a solution to the likelihood 
equation. Thus we will take a = x* , and the existence of a maximum likelihood estimate will be 
established if we can show that h < 1 and t* < (logr)/2. 

To show this, set r = exp(4 5), so that t* < 25 = (logr)/2. It is then straightforward 
to verify that if £o < So = {15 (Co + 1)}~ 2 , where Co is as given in Assumption [I] then 
l-^lloo < 10 (Cq + 1) £o < L satisfying the requirements of Lemmas [7] and [8j Moreover, if 



also r = exp(4<5), then A < 16 (C + 1), h< 1, and L x x < 5 (C + 1). By |Gragg & Tapia 
( 1974), \\x* — x k \\oo < 2~ k+1 \\xi — £o||oo> an( l so the result of Theorem [l] then follows, since 



\xi - X \\oo < S = L\ XEq. 



4. Discussion 
4-1. Implications 

The main implication of the above results is that a broad class of null models for undirected 
networks give rise to roughly the same maximum likelihood estimates of edge probabilities. For 
practitioners looking to capture degree heterogeneity in their null models, then, this provides 
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verifiable assurances that the particular choice of null model will not give meaningfully different 
conclusions, provided the null model can be written in the form M. £ such that Assumption [T] is 
satisfied, and the dataset is sufficiently sparse. Empirically, as we show below, our approximation 
bounds and sparsity conditions appear conservative in practice. 

When comparing to the extant literature, two recent results warrant discussion. First is that of 
Chatterjee et al. ( 201 1| ), in which the authors show that a unique maximum likelihood estimate 
exists with high probability when the model -Mi og it is in force, and give an iterative algorithm that 
converges geometrically quickly when a solution to the likelihood equation exists. In contrast, 
our results are deterministic, and do not require the data to be generated by any particular model. 



Rinaldo et all (120 111) also address the model Ai\ gA and its version for directed graphs, focus- 



ing on necessary and sufficient conditions for the existence of a maximum likelihood estimate, 
and the failure thereof, as a function of the polytope of admissible degree sequences for a given 
network size. As with Chatterjee et al. ( 2011| ), however, their existence results are probabilistic 
in nature; one interpretation of our results in this context is that our sparsity conditions are suffi- 
cient to avoid the pathological degree polytope conditions that give rise to the many nonexistence 



examples considered by Rinaldo et al. (201 1 ). 



4-2. Empirical evaluation 
To evaluate how conservative our sparsity condition and bound on the universal constant C 
in Theorem [T] appear in practice, we fitted the models M.\ og , A^ c iogiog> and -A/fi og it defined in 
(2.2a - 2T2c"] ) to nine different network datasets of sizes ranging from n = 34 to n = 7610: 



Zachary 1 19771 
Girvan & Newman|j2002} 
Humm on et al. 1 1 99u| 
Gleiser & Danon 1 2003 1 
Duch & Arenas l2 005r 
Adamic & Glance 1 2005 1 
Newman | |2006| 
Watts & Strogatz 1 1998 
Newman (2001 1 



Social ties within a college karate club 

Network of American football games between Division IA colleges 
Citations among scholarly papers on the subject of centrality in networks 
Collaborations between jazz musicians 
Metabolic network of C. elegans 
Hyperlinks between weblogs on US politics 

Coauthorships amongst researchers on network theory and experiments 
Topology of the Western States Power Grid of the U.S.A. 
Coauthorships among postings to a High-Energy Theory preprint archive 



In each case we obtained a maximum-likelihood estimate a and compared it to a. Figure[T]shows 
the approximation errors in dj for the dataset analyzed by Newman ( |2001 ) as a typical example, 
and Table [2] summarizes results for all nine datasets. 

Two empirical confirmations of Theorem [T] are that for these datasets and models, the supre- 
mum norm distances || a — a\\oo are of order C £q, while the Euclidean norm distances \\a — ck|| 2 
are of order y/nC £q, with eq taken to be m&xX? + /X + + in each case. Here the corresponding 
constants appear conservative by factors of order 10 3 and 10 2 , respectively, suggesting that our 
results may in fact hold under less stringent sparsity conditions. 



4-3. Avenues for future work 
The simple, well known, and computationally convenient estimator featured in our results has 
conceptual as well as computational advantages. As a monotone transformation of the observed 
degree sequence, it can also be seen to yield a parametric interpretation of the classical degree 
centrality ranking metric common in social network analysis. While we do not pursue this ap- 
proach further here, we also note that the approximation to the Fisher information arising in 
our proof of Theorem [T] can be used to obtain an approximate asymptotic covariance expression 
in this setting, avoiding a requisite matrix inversion that may be prohibitively costly for large 
datasets. 
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Fig. 1. Scaled approximation error (&i — &i)/(Ceo) plot- 
ted as a function of degree Xi+ for the network analyzed 
by |Newman| (2001>, using the cloglog, log, and logit links. 



Table 2. Approximation error in terms of Eq fo r several datasets. Percentage valid is defined 
as (100/n) I(Xi+/X++ < Eq). For TheoremUlto apply, this value should be equal to 100; 
however, the corresponding approximation results hold across the range of datasets considered. 



Dataset 




X + + 


maxX, + 


Link 


Valid % 


||A-a|| a 


lia-a||oo 


|Zachary|l977| 


34 


156 


17 


cloglog 

log 

logit 







0.004 
0.006 
0.009 


0.01 
0.02 
0.03 


|Girvan & Newm£in|2002| 


115 


1226 


12 


cloglog 

log 

logit 







0.02 
0.005 
0.02 


0.02 
0.01 
0.03 


|HLimmonet al.|l990| 


118 


1226 


66 


cloglog 

log 

logit 


10 
19 
10 


0.003 
0.002 
0.004 


0.01 
0.01 
0.02 


|Gleiser & Danon|2003| 


198 


5484 


100 


cloglog 

log 

logit 


6 
7 
4 


0.004 
0.002 
0.005 


0.02 
0.02 
0.02 


|Dtich& Arenas|2005| 


453 




237 


cloglog 

log 

logit 


5 
36 
5 


5e-04 
6e-04 
6e-04 


0.004 
0.009 
0.005 


|Adamic & Glance|2005| 


1224 


33430 


351 


cloglog 

log 

logit 


42 
50 
38 


9e-04 
0.001 
0.002 


0.006 
0.02 
0.01 


|Newman|2006| 


1461 


5484 


34 


cloglog 

log 

logit 


63 
75 
46 


0.002 
0.003 
0.001 


0.01 
0.02 
0.01 


|Waits & Strogatz|l998| 


4941 


13188 


19 


cloglog 

log 

logit 


93 
97 
80 


0.001 
0.002 
0.001 


0.01 
0.02 
0.01 


|Newman|200l| 


7610 


31502 


50 


cloglog 

log 

logit 


S7 
94 
78 


9e-04 
0.001 
8e-04 


0.01 
0.02 
0.009 
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More generally, we expect that the methods used in the proof of Theorem [T] will also find 
application in investigations of alternative network models. For example, one can show that a 
valiant of Assumption[T]holds for the degree-corrected blockmodel of Karrer & Newman (201 1 1 
and variations thereof. We therefore surmise that it may well be possible to establish similar 
universality results for these models. 
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Appendix 

Technical Lemmas for Proof of Theorem^ 
Recall our earlier definition of Af r = {a : \\a — <S||oo < (log r)/2}, defining a neighborhood of a pa- 
rameterized by r > 1. For a £ Af r , Lemmas [TJTJjpro vide bounds on and py , as well as their first three 
partial derivatives, along with fij,fij and their partials. Lemmas [4] and [5] provide approximations for 
derivatives of the log-likelihood at a, and bounds on the change in its second derivative in a neighborhood 
of a. The bounds in Lemmas [Tj-B] are straightforward to verify, and thus their proofs are omitted. 



LEMMA 1 . IfaE M r , then £y and its first three partial derivatives are bounded by Cq pij r. 
Lemma 2. If a € M r , then < P$pij, where P$ = Po(r) = r exp(Co £0 t). Furthermore, 



dpi 



< P 



1 Pij, 



d 2 p l] 



da k da h 



< P 



2 Pij, 



d 3 p t 



< P 



3 Pij- 



where P 1 = P -{1 + C £ r),P 2 = P Q - {(1 + C £ r) 2 
(1 + C e r) C e r + C £0 r}. 

Lemma 3.1fa£ Af r , then 

1- \fij\ < FoPij and\fij\ < F Q pij, where F = C Q r - 



da k da k dai 
C e r}, and P 3 = P ■ {(1 + C £ r) 3 



2. 



dfij 



< F\ pij and 



dfi, 



< Pi Pij , where F\ — C^r + Eq 



i-p o£o and F = C r exp(C £ r) + P ; 
2 

' : and F\ = 



cxp(C £ r){C r + C F a e Q r + Pi}; 



1-P e 



3. 



d 2 h 3 

da k dai 



< F 2 pij, where F 2 = C r + 2e 2 (1^) + 3e ° + T 



Pa 



LEMMA 4. The following approximations for the derivatives of the log-likelihood at a hold: 



1. 
2. 

3. 



US 5 ") <LxX k+ e Q , where L x =1 + P (1) +P (1); 

da^da, (") + P~ kl <L 2 X k i£Q + L 2 Pkl£o, where L 2 = F 1 (l) and L 2 = F (l) + Pi(l), pro- 
vided k I; 



d Jr(a)+X k+ +p kk 



< L3 X k+ £ , where L 3 = 2 + L 2 + L 2 . 



Lemma 5. If a, a' e Af r , then 



1. 



2. 



d£k~S a )- dSk( a ')\ ^ {MiX kl + Mip H )\\a' -aWoc, where M 1 =F 2 e and M x = 
2r (1 + F + Fx), provided k ^ I; 



(a) 



(a') < M 2 X k+ \\a' - a||oo, where M 2 = M x + M x . 
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Lemma 6]bounds the size of E, the relative error in approximating V 2 l(a) by H, and the remaining 
Lemmas 17 p\ verify that the necessary hypotheses are satisfied in order to apply Kantorovich's Theorem 
to bound the error in Newton's method. 

Lemma 6. ||£||oo < B e , where B = (3/2)(L 2 +L 2 + L 3 ). 

Proof. Matrix E is given by E = H^ 1 (V 2 £(d) — H) . In light of Lemma[4]and the triangle inequality, 
\E\ < (V 1 + ^H T ) (L 2 e X+ ^dd T + L 3 e d) 

= eo( L2+ 2 y + 2 + +L3 l^ T +L 2 D- 1 X + L a j). 
Thus, \\EWn < {3/2)eo(L 2 + L 2 + L 3 )=B e . □ 

Lemma 7. IfB e < 1, then \\ [D- l V 2 £(a)] _1 \\ x < x, where x = (3/2)/(l - B e )- 

Proof. This follows from Lemma|6]and Lemma 2.3.3 in |Golub & Van Loan| (fr996), with the bound 

||[z^- 1 v 2 £(cfe)]- 1 || 00 < iKz + ^iunif-^iu. 

Lemma 8. IfB e < 1, then |[[V 2 ^(o:)] -1 [V^(a)]|[ < 5, where 5 = L\ xe - 

Proof. In light of Lemma|4j we know that \ V£(a) \ < L\ Eq d. The result now follows after bounding 

|| [V 2 ^)]- 1 ^)]^ < WiD-^lia)}- 1 ]^ ||Z>- 1 V^(fi)]]| 00 . 

Lemma 9. Ifa,a'e Af r , then 

||£>- X [V 2 %)] - ^[V^aOlL < A Ha - a'|U 

where A = 2M%. 

Proof. This is a direct consequence of Lemma[5] □ 

Proof of Corollary^ 

We aim to show that — pij\/pij < 24 (Co + 1) £q under the conditions of Theorem[T] Write 



Pij Pij 



= I exp{(d l - Hi) + (&j - aj) + £ij(dti, <%)} - l| 



Using the bound &i < (1/2) log eo from the hypothesis of Theorem[T] along with the result of the theorem 
that ||q — alloc < Ceo, we ma y write 

&i < i log£ + Ce . 
By As sumption [T] and the above, it thus follows that 

|£jj(&i,&j)| < C exp(dj +Qtj) < C e exp(2Ce ). 
Now, since eo < e~o = {15(Co + 1)}~ 2 and with C = 10(Co + l),we obtain after simplification that 

|(&i - &i) + [&j - &j) + s ij (a i ,a j )\ < 21-1 e (C + 1) < log 14. 
The result then follows by using the bound \e x — 1| < \x e x \. 



Null models 1 1 



Proof of Corollary^ 

Corollary |]claims \i - < 49 (C + 1) e . From Corollary[l] we obtain the bounds 



log Pi j - log pi j\ < Ci£ , 

e 



I l0g(l - Pij) - log(l - p ti )| < Cl T"^— Pij, 

1 — £n 



with Ci = 24 (C + 1). Therefore, 



1^ - *| < Ci £ V + d — °- V(l - Xijpij < d 



Now write 



a, X i+ - ^(1 - Xij) log(l - Py) 



> 





=i 




+ 


2 






+ 


2 






+ 


2 




x+ 


+ 



i<3 



log x ++ - J2 x i+ lo § ^+ - E ( 1 - x ^ ) lo s( 1 - p« ) 

i=l 



»<3 



log 



n 2 



2 

Finally, putting the two bounds together and using that Eq < So < 15~ 2 , we get 

^<2C lT ^<49(C + l) £o . 
\£\ l — £o 
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